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Abstract 

We present a robust method which translates information on the speed of coming down 
from infinity of a genealogical tree into sampling formulae for the underlying population. 
We apply these results to population dynamics where the genealogy is given by a A- 
coalescent. This allows us to derive an exact formula for the asymptotic behavior of the 
site and allele frequency spectrum and the number of segregating sites, as the sample size 
tends to oo. Some of our results hold in the case of a general A-coalescent that comes 
down from infinity, but we obtain more precise information under a regular variation 
assumption. In this case, we obtain results of independent interest for the time at which 
a mutation uniformly chosen at random was generated. This exhibits a phase transition 
at a = 3/2, where a £ (1, 2) is the exponent of regular variation. 
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1 Introduction and main results 



Coalescents with multiple collisions, also known as A-coalescents, are a class of Markovian 
coalescence models, introduced and first studied by Pitman J2B] and independently by Sagitov 
[28] , and were already implicit in a contemporaneous work of Donnelly and Kurtz in [11] . They 
arise naturally as scaling limits for the genealogy of exchangeable population dynamics. This 
connection to population genetics has motivated a large number of works around the study of 
A-coalescents, an introduction to which may be found in the recent surveys 0(8] for instance. 

The following question is natural in the context of population genetics: assuming that the 
genealogical tree of a sample is a A-coalescent (definitions will be given below), how much 
genetic variation do we expect to see? A complete answer exists in the special case where 
only pairwise collisions are possible, due to the celebrated Ewens sampling formula p3] for 
the Kingman coalescent |21| . More recently, partial results have been obtained by Berestycki 
et al. [7] in the particular case of Beta-coalescents (for a general overview of previous results 
on the subject, we refer the reader to Durrett [13] or Berestycki [8].) 

The main goal of this paper is to address this question in general, by providing a robust 
method which translates information about the speed of coming down from infinity of the 
genealogical tree into an explicit asymptotic formulae (as the sample size increases to oo) for 
quantifying the genetic variation. Since the speed of coming down from infinity was recently 
analyzed by the authors in [4J, this method in combination with results from [3J enables us 
to obtain the following results: 

(a) In Theorem O we obtain in complete generality (i.e., for arbitrary finite measures A 
such that the corresponding A-coalescent comes down from infinity) a deterministic asymp- 
totic rate of growth for the number of distinct alleles in a sample and the number of segregating 
sites (the famous SNP count, or single nucleotide polymorphism). The formula involves a cer- 
tain function ip{q)-, which is the Laplace exponent of the subordinator whose Levy measure is 
precisely x~~ 2 A(dx). Furthermore, the above convergence in probability is strengthened to an 
almost sure convergence, provided that the measure A satisfies an additional regular variation 
condition in the neighborhood of zero (precise assumptions will be given below). 

(b) In Theorem [6] we derive explicit almost sure asymptotic formulae for the frequency 
spectrum, in both the infinite site and the infinite allele models, in the case where A is regularly 
varying near zero. 

This last result is a significant improvement and a generalization of previous work of 
Berestycki et al. [61 [7] and of Schweinsberg [31], both in the sense that the result is valid for 
more general measures A, and in the sense that the convergence holds almost surely rather 
than in probability. Our methodology is completely different from that of [7J which relied on 
an embedding into stable continuous random trees (CRT), allowing for explicit computations. 
As explained above, the argument here is based on the recent work by the authors on the 
speed of coming down from infinity [4], and a novel general method which translates such 
results into results about sampling formulae. This method is more robust than previous 
approaches to this problem, which explains why the results here are both stronger and more 
general. 

We note that the asymptotics in probability for the number of distinct alleles in a sample, 
under the model where the genealogy is driven by the general (regular) H-coalescent dynamics, 
was obtained in parallel by Limic [23] using an adaptation of the martingale method that led 
to the results in [3j and [22] (see Remark I lip . 
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In the sequel, we denote by => the convergence in distribution, and by = the equivalence 
in distribution. We also use the standard Bachmann-Landau notation ~, O(-), o(-), x for 
comparing asymptotic behavior of deterministic and stochastic functions and sequences. 

1.1 Mutation models 

We now describe the underlying framework for the sampling results in more detail. Consider 
a sample of re individuals, where n is a fixed number tending to infinity. Assume that the 
genealogical relationship between these individuals is given by a A-coalescent, where A is 
an arbitrary finite measure on [0,1]. That is, the genealogical tree is a Markov process 
(n n (t),t > 0) on the space of partitions of {l,...,n} with the following transition rates: 
whenever IP has b blocks, any fc-tuple of them merges at rate := x k ~ 2 (l — x) b ~ k A(dx). 

In order to discuss genetic variation, we need to specify a mutation model. The two most 
widely used and tractable models are the infinite sites model and the infinite alleles model. To 
familiarize oneself with these models, it is also useful to think in terms of the forward-in-time 
evolution dynamics for the whole population (and not only in terms of the backward-in-time 
coalescent dynamics). 

In the classical infinite sites model, introduced by Kimura [19] in 1969, any individual 
is affected by neutral mutations at constant rate > 0. Here it is also assumed that the 
number of loci (the size of the genome) is large, so that each mutation occurs at a new locus. 
In particular, if an individual is affected by a mutation, then all the descendants of this 
individual carry this mutation (see Figure 1). Conversely, the genetic type of any individual 
in the sample depends on the entire history of its ancestral lineage. We denote by S n the 
number of segregating sites, or the total number of distinct genetic types in the sense just 
described. This is the same as the omnipresent SNP count (single nucleotide polymorphism) 
from the biological literature. 

The infinite alleles model is similar but with one difference. As above, any individual is 
affected by a mutation at constant rate 8. However, it is now assumed that every mutation 
changes the allelic type of the individual into something new, distinct from anything else 
(already seen or yet unseen) in the population. Thus, the allelic type of an individual in 
the sample is entirely determined by the most recent mutation affecting the corresponding 
ancestral lineage. The allelic partition is the partition of the sample (represented by a partition 
of {1, . . . , re}) obtained by grouping together the individuals that carry the same allelic type. 
Denote by A n the number of blocks in this partition, or equivalently, the total number of 
allelic types expressed in the sample. 

Remark 1. The two models differ only by the amount of information that is assumed to be 
available in the sample. In the infinite sites model the assumption is that the precise allelic 
type (e.g., the entire DNA sequence) is known for each individual in the sample. On the other 
hand, in the infinite alleles model, the only available information is whether two individuals 
carry the same type or not. Hence for different types we do not know how they differ. 

Thus the infinite alleles model contains less information than the infinite sites model, 
and is more appropriate in practice for situations where the only available information is, 
for example, based on observed physiological differences. On the other hand the infinite 
sites model is more natural when the full genetic information (the DNA sequences of each 
individual in the sample) is available. 

The above random variables can be realized in a natural way on a common probability 
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space as follows. Consider a A-coalescent (IIj, t > 0) that comes down from infinity, and let 
T be the associated coalescent tree. Then T is a tree with infinitely many leaves 1,2,..., 
and the root given by the most recent common ancestor among all the individuals. Each 
branch of T is endowed with a positive number, its length or the size of the interval of time 
that elapsed between the two defining coalescent events for this branch (the one that started 
and the one that ended it) . Let V be a Poisson process of mutations on the branches of T, 
where the intensity of mutations is constant and equal to per unit length. Restricting T 
to the first n leaves produces a finite tree (even if the A-coalescent does not come down from 
infinity), denoted by T n , that has the law generated by the same A-coalescent started from 
n particles. The restriction of V to T n is identified as the mutation process on T n , and it is 
a sufficient statistic for S n and A n . It is useful to note here that T and V alone determine, 
simultaneously for all n, the values of A n and S n , as well as various related quantities to be 
introduced in the sequel. Moreover, the coupling induced by this procedure between T n and 
T m for m < n, is canonical from the sampling perspective, in that the mutations that arrive 
onto T n also arrive onto T m . On the asymptotically unlikely event {V fl T n = 0}, we declare 
A n = £>n = 0. 

1 4 3 2 5 6 



| 



Figure 1: The genealogical tree T ra for a sample of size n = 6. The mutations on the (vertical) branches 
of T„ are indicated as dots. The dot encircled in black corresponds to the mutation which is not seen 
under the infinite alleles model. The dot encircled in gray will be referred to in Section [2j Thus we 
have S n — 5 while A n = 4. 

1.2 Sampling formulae 

Let A be a finite measure on [0, 1]. We will assume without further mention that A(l) = 
(for reasons why this can be done without loss of full generality see any of [H [26} I29j). 

Definition 2. We say that A has (strong) a-regular variation at zero if A(dx) = f(x)dx 
where f(x) ~ Ax 1 ~ a as x — > for some 1 < a < 2 and A > 0. 

For any given finite measure A on [0, 1], associate a function ip\ = ip defined by 

^(q) ■= [ ( e -l x - 1 + qx)x~ 2 k(dx). (1) 
J[0,1] 

The function ip is the Laplace exponent of a Levy process, which is intimately connected with 
the behavior of the A-coalescent. These links are discussed in a companion paper [5]. 
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In this paper we will usually require that 

dq 



s: 



I . x < oo, (2) 

which is known as Grey's condition. As was proved by Bertoin and Le Gall [10| (see also [5] 
for a probabilistic proof), this is equivalent to the requirement that the A-coalescent comes 
down from infinity. One can check (see e.g. |15j XIII. 6) that in the case of strong a-regular 
variation, 

AT(2-a) a 

W(Q) ~ — 7 rrQ , as q -4 oo, (3) 

a(a — 1) 

where A is the constant from Definition [2j so in particular the Grey condition ([2]) holds if 
a £ (1,2). Our first result concerns the asymptotic behavior of the number 5 n of segregating 
sites and the size A n of the allelic partition. 

Theorem 3. Assume |I|) and Zei X n denote either A n or S n . Then 

► <5», (4) 

qip(q) 1 dq 

in probability as n — > oo. Moreover, if A has (strong) a-regular variation at zero, then the 
above convergence holds almost surely, implying 

n a ~ 2 X n — > 6B, almost surely, (5) 

where B = B(A,a) := a(a - 1)/[AT(2 - a) (2 - a)]. 

Conjecture 4. Though our proofs rely on the fact that the A-coalescent comes down from 
infinity, we observe that the above statement does not appeal to the function v(t) and hence 
could hold in general. Due to the result of Basdevant and Goldschmidt [3] (see also Remark 
8 in JM3j for a short and more robust argument) on the number of allelic families in the 
Bolthausen-Sznitman coalescent, it is easy to check that holds in this special case, whereas 
([2]) does not hold. We conjecture that ([4]) holds for all the A-coalescents even in the L 1 
convergence sense. 

Remark 5. Let ?/>(</) := Jj ^((1 — x) q — 1 + qx)/x 2 A(dx) be a close relative of ip (note that 

| ?/>(<?) — ip(q)\ = 0(q) and moreover that ip(q) ~ ip(q) as q — > oo). Applying the optional 
stopping formula to the following submartingale 

dq- I N A ' n {u)du, t > ) 



l N A,n(t) ip(q) Jo 

gives an expectation upper bound in support of the above conjecture. We refer the reader to 
(jlip for the rest of the notation, and to the above mentioned remark in |22j as well as the 
argument leading to (25) in [24\ for applications of similar processes in the study of coming 
down from infinity. 

We also obtain precise results for the full frequency spectrum in both infinite sites and 
infinite alleles models. For each n consider a sample of size n, and for each k € {1, . . . , n}, 
let Ff. n be the number of families of size k in its allelic partition, and M n & be the number of 
mutations affecting precisely k of its individuals under the infinite sites model. 
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Theorem 6. Suppose that A has (strong) a-regular variation at zero. Recall the constant 
B = B(a, A) from Theorem^ Let Xk )n denote M^ in or Fk, n , where 1 < k < n. As n 



oo, 



-.2- a 



6B 



(2 - a)T(k + a - 2) 



a.s. 



n"~" fc!r(a - 1) 

Moreover, if P\,p2, ■ ■ ■ are the ordered allele frequencies in the population, then 



P ~ Cj 



-l/(3-a) 



(6) 



(7) 



almost surely as j — > oo, and C = (9B/T(a - l)) 1 /^-*) . 



By the properties of the Gamma function, another expression for the constant on the 
right-hand side of © is 

(a-l)...(a + lfe-3) 



6*5(2 - a)- 



k\ 



As mentioned in the Introduction, the above results are improvements over previously known 
results, since the convergence ([6]) was known to hold only in probability in the case of Beta- 
coalescents (see [7]), while ([7]) was not known to hold even in this special case. 

One key ingredient for our arguments, beside our earlier work on the speed of coming 
down from infinity ([4]) is the asymptotic study of the time at which a randomly chosen 
(uniformly) mutation was generated. More precisely, let the time run in the "coalescent 
direction" (backward from the point of view of the population dynamics), so that the leaves 
of the tree are present at time 0, and the number of branches decreases in time. Denote by M n 
the time-coordinate (age) of a point chosen at random from V n T n . On the (asymptotically 
unlikely) event {V D T n = 0} = {S n = A n = 0}, we set M n = (although this value could be 
set to anything between and the time of MRCA (the root of T n ) and the next result would 
still be true). 

Define 

' ■ 1 ~ Qi , if 1< a < 3/2, 

^logn, if a = 3/2, (8) 
'" 2 , if3/2<a<2. 



n 



n 



n 



Theorem 7. Suppose that A has (strong) a-regular variation at zero, for some a G (1,2) 
(a) We have 

M n a 



n 



l-Q 



AT {2 - a) 



1. 



0) 



where U = Unif[0, 1]. 

(b) If in addition A[l — 77, 1] =0 for some rj > 0, then there exists c\ = c\{a) € (0, 00), such 
that for g given by 

E(M -» - - (10) 



lim 

n^oo g[n) 



Remark 8. Note that M n and E(JVf n ) are only of the same order of magnitude if a < 3/2. 
Naturally, this is because the limit variable is integrable if and only if a < 3/2. 

Interestingly, g(n) observed as a function of a decreases on (1,3/2) and increases on 
(3/2,2), and moreover has a discontinuity on both sides at a = 3/2. It seems difficult to see 
intuitively why this happens. 
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Remark 9. We believe that the result (|10p should hold without any further restriction on 
A than strong regular variation. The techniques used in the proof of (10) can be used to 
show, with some additional effort, that the sequence E(M n )/g(n) is bounded away from 
and infinity when no assumption is made on the support of A. However, in the interest of 
brevity we decided to omit these arguments. 

The link between Theorem [3] and Theorem [6] is provided by a remarkable Tauberian 
theorem for random partitions of Gnedin, Hansen and Pitman [16]. The assumptions of this 
theorem were recently extended in an independent but related work of Schweinsberg |31j . to 
deal with convergence in probability (to which the approach of [T6] could not apply). This 
allowed him to obtain the convergence in probability of Theorem [6] for the limiting behavior 
of Ffc n (though not for that of M/ Cj „). Gnedin, Hansen and Pitman [16] also derive a central 
limit theorem for F^ n . It is natural to ask whether this result can be extended to our setting 
with random frequencies. Kersting [18] has recently obtained precise fluctuation results for 
the length of the genealogical tree in the regularly varying case. In particular, it follows from 
his Theorem 1 that these fluctuations are not Gaussian. In order to resolve the just mentioned 
open problem, one would need to analyze the complex interplay between the fluctuations of 
the tree length and the Poisson fluctuations of the mutations. 

Organisation of the rest of the paper. Section [2] is devoted to proving the results 
on the mutation frequency spectrum, announced in Section 11.11 More precisely, we prove 
Theorem [7| in Section 12.11 Theorem [3] in Section 12.21 and Theorem [6] in Section 12.31 The final 
section relaxes the technical condition on the support of A needed in the proofs of Theorems 
[3] and El 



2 Proofs of the results 

Fix some 6 > 0. For each n £ N and t > 0, let N A,n (t) denote the number of ancestral 
lineages of the first n individuals remaining at time t. In particular, (N A > n (t), t > 0) is a 
continuous-time Markov jump process, starting from N A,n (0) = n. We assume throughout 
this section that (|2]) holds, or that equivalently, the A-coalescent comes down from infinity: 

N A (t) := lim N A > n (t) = sup N A ' n (t) < oo, Vt > 0. (11) 

rwoo n > x 

We will need some further notations. Define for I,tiGN, 

r£ = inf{t > : N A ' n (t) < k}, and T k = = lim r£ = intit > : N A (t) < k}. (12) 

In particular rj 1 = inf{t > : N A ' n (t) = 1} is the time of the MRCA for the sample containing 
the first n individuals, and L n = J^ 1 N A ' n (t) dt is the total length of the tree T n . 

The genealogical tree T is a path-connected set in M. 2 . To any point x on the tree one can 
associate a number t = t{x) called the time- coordinate or the age of x, which is defined as the 
distance from that point to the set of leaves of T. Let T n be the subtree of T consisting of 
all the points in T having age in [r n ,ri]. Then L n = J^ 1 N A (u)du is the length of T n . The 
function 

„( ( ):=inf{ S >0:^^< t } (13) 
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plays a central role in the analysis of asymptotic behavior of T n . Finally, define t n as 

t "-r^5"^ ) - <i4) 

The following lemma gathers some asymptotic results which we will use in the rest of the 
proof. Set 

a = ^^€(0,oo). (15) 
a — 1 

Lemma 10. Assume ([3]) and define c = c(A,a) = a/(AT(2 — a)) (compare with the constant 
in ^)). Then, as n — > oo, we have almost surely 

1. T n ~ cn l ~ a , and 

2. t„, ~ cn 1 - . 

Furthermore, there exist c\ = c(A) > 0, and c 2 = c(A,a) G (0, oo), such that 

3. P(n > x) < e~ Cl - x , for all x>\, 

i 

4. L n = N A (u) du ~ f tn v(u) du ~ ^-(rn)' 51 ~ § n 2 " a , a.s. as n -> 00, 

5. as x — )• 

r c 2 x- s+1 , a > 1, -i f c 2 x~ a+1 , a > 1, 

uiV A (u)du~< C2log(l/x), a = 1, a.s. and / nw(u)du~< C2log(l/x), a = 1, 
[ Y, a < 1, ^ [ c 2 , a < 1, 

(16) 

where c 2 = c 2 (a) and K := JJ" 1 uN A (u)du is a finite random variable if a < 1. 
Proof. Theorem 1 in [3] and Q yield 

iV A (t) ~ v{t) ~ c^-it a- 1 , as t — > 0, almost surely, (17) 

where c = c(j4, a) is as specified above. The asymptotic behavior (fT7|) implies that N A (r n ) = 
n(l + o(l)), almost surely, as n — >• 00. Indeed, since r n — > 0, we have A rA (r n ) ~ n(T n ) = 
v(T n — ) ~ N A (r n —), and at the same time, P(A rA (r n ) < n < N A (T n — )) = 1. Since, again due 
to (|17p -/V A (r n ) ~ c"- 1 (T n ) _a - 1 , we obtain claim 1. Claim 2 is directly seen from (|14p and 
the asymptotic behavior of v in (I17|) . 

Claims 4 and 5 are derived similarly. One notes first that, due to P(ti > 0) = 1 and (|17p . 



A/ ^ • ^ , C^i 



iV (u) du ~ / v(n) (in ~ — = — a; a as x — > 0, 

I X J X & 

and then uses the facts that r n — > 0, i n — > as well as claims 1 and 2 to obtain claim 4. For 
claim 5, we use niV A (n) ~ c- 1 n -1 ' ( a_1 )+ 1 = c°-iu" n , and this uniquely determines c 2 . If 
a < 1, then both uv(u) du and J Q T1 uN A (u) du are finite. 

It remains to verify claim 3. Due to monotonicity of the coalescent and the simple Markov 
property, we have P(t" > m + l|rf > m) < P(r™ > 1). In turn, letting n — > 00, 

P(n > m + 1\ti > m) < P(n > 1), for each m > 1. 

Hence, by induction, P(ri > m) < e~ Cim for all m > 0, with c\ = logP(ri > 1). □ 
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2.1 Proof of Theorem [7] 



Our first goal will be to prove Theorem [71 Let A™ = {S n > 1} = {A n > 1}. Since the length 
L n of T n diverges, we have that P(-A™) — > 1, as n — > oo. Recall that on A™, M n is the age of 
a randomly chosen mutation in T n , and that on the complement of A™, M n is set to 0. Due 
to basic properties of Poisson point processes, on the event A™ (of overwhelming probability), 
the random mutation is positioned as a point P* chosen uniformly at random from T n . In 
symbols, 

M n =M*-l A n+0.1 {A n )c , (18) 

where M* is the time-coordinate of P*, and P* is independent of AT. Due to this indepen- 
dence, and the fact P(^) -> 1 as n -> oo, we have that E[M* • l (j4 «)c] = E(M*)P((^ 
o(E(M*)), and therefore E(M n ) ~ E(M*), as n ->■ oo. Similarly 

P(M n • n^ 1 < s) = P(M* • n"™ 1 < x) + 0(P((^) C )), Vx > 0, 

therefore Q is equivalent to 



Ml 



n 



l-a 



c(U 



-(a-l)/(2-a) 



1), where C/ = Unif[0,l], 



(19) 



Hence we proceed by studying M*. 

Now recall the subtree T n of T. Consider a uniform random point on T n and let M n be 
its age minus r„. Let £ n = {N^ = n} be the event that II ever attains a configuration with 
exactly n blocks. Due to the consistency property and the Markov property of (TLt,t > 0), 

the conditional law of T n given T Tn on the event £ n equals the law of T n . (20) 

This clearly induces the equivalence of the conditional law of M n given J- Tn on £ n and the 
law of M*, which will be used below. Recalling (fT2|) . we have 



P(M n > x\T n ) 



x € [0, ri - T n 



So, recalling a from (|15p . we have for a fixed y € (0, 1 

- f-Tl 
J 7 



P((M n /r n + l)" a < y) = E 



!/«Ari 



N A (u) du 



L, 



(21) 



By the argument used to show claim 4 of Lemma [10] and P(ti > 0) = 1, we conclude 



T„tr 1/a ATi 



iV A (u) (iu^c'^y- 1 /" An 



-l/(o-l)+l 



-l/ai-l/(a-l)+l 



c'T" n °y, a.s. 



as n — > oo, where c' = c 1//( - Q ^ /a. Due to claim 4 of Lemma [TU1 we now see that the random 



variable inside the expectation on the RHS of ([21]) converges to y, almost surely. Since 



|E[P((M n /r n + l)~ a < y\T Tn )l £n ] - y¥(£ n )\ < E 



1 s n 



and since the bounded (by 2) random variable inside the expectation converges to al- 
most surely, the dominated convergence theorem implies that the left-hand side converges 
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to 0. Recalling (|20p . or its consequence for random points, this quantity can be rewritten 
as | P((M*/r n + l)~ a <y)-y\ P(S n ). Theorem 1.8 in [6] gives P(£ n ) -»■ a - 1 as n -> oo, 
hence P((M*/r n + 1) _Q < y) — > y, for all y € [0, 1]. This, together with claim 1 of Lemma [TU1 
implies ©. 

T/ie proof of lilU\) is analogous but technically more delicate. Due to P(M* > x\T n ) = 
j T i jV A ' n (u) du/L n and Pubini's theorem, we have 

, *, x fn°° N A < n (u)dudx f^uN A > n (u)du , 

J Q Tl N A ' n (u)du 

Therefore, E(M*) = E [F n ], where 

y n = — — . (23) 

P N A ^(u) du 

Due to (|20p . the variable Y n is equal in law to 

J T1 (u - T n ) N A {u) du f 1 u N A {u) du 
Yn = " P:N A (u)du = P n N A {u)du ~ Tn > giV6n ^' ° n the ^ *»■ (24) 

Using Lemma[10]one can analyze the asymptotic behavior of Y n /g(n) in each of the three cases 
a > 1, a < 1 and a = 1 (corresponding respectively to a < 3/2, a > 3/2 and a = 3/2). First 
observe that T n /g(n) —¥ almost surely if a > 3/2, and otherwise T n /g{n) — > c almost surely, 
where c is the constant from claim 1 of Lemma [TUl One can apply claims 4 and 5 (plugging 
in r n as x) of LemmaQIJlto the first term in (|24p . More precisely, if we let C3 = C2<5/c s and 
C4 = a/c, then 

1. if a > 1 then f r n uN A (u) du ~ C2T~ a+1 so Y n /g(n) ~ C3 — c. almost surely. 

2. if a < 1 then f^ 1 uN A (u) du ~ Y so Y n /g(n) ~ C4Y, almost surely. 

3. if a = 1 then J*^ 1 u N A (u) du ~ — C2logr n ~ C2(a — l)logn so Y n /g(n) ~ (a — l)c2/c, 
almost surely. 

With a slight abuse of notation, let Y := lim^^oo Y n /g(n), almost surely. Clearly P(Y > 0) = 
1. Denote by V C (Y) the set of points of continuity for the distribution function of Y. Then 
for any x > in T> C (Y) and any sequence (B n ) n >i of events, where B n € T Tn , n > 1, we have 

E ( 1 Bn E (l 1 {Y n / fl (n)< !r } - 1 {y<*}l I -^J) = as n 00 ■ ( 25 ) 

We claim that 

y, as n — > 00, (26) 



which can be verified as follows. Note that, due to (|20|) . for each fixed x > 

y n < ^ _ E[l £n P(Y n / 5 (n) < x 1 



5(n) " / P(£„) 
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Backward martingale convergence and measurability Y E To imply lim n ¥(Y < x \ T Tn ) = 
F(Y < x | Tq) = F(Y < x). Combined with ([25]) and the fact lrmmf n P(£ n ) > 0, this gives 

lim n P (yjj^ < %) = W(Y < x), for each x 6 V C (Y), or equivalently, the convergence ([26]) . 

To conclude ([TO]) from ([26]) . it thus suffices to show that (Y n /g(n)) n >i is a uniformly 
integrable family. In fact we will now show that this family is uniformly bounded in I? . Due 
to (I23D . we have F(Y n < r") = 1, and in particular 

y nl{rf < g (n)} < almost surely. 

Due to claims 4 and 5 of Lemma [10] we know that g(n) ~ c uv(u)du/ J. v(u)du for some 
c= 1/C € (0, oo). Therefore 

r r i r 1 

/ uiV A ' n (u)cin / v(u)du 

< C- 2 — 1 K">9(«)} + L ( 27 ) 

uv(u)du / iV A ' n (u) ck 



Denote by A n (resp, -B n ) the first (resp. second) ration on the LHS of ([271) . so that < 
CA n ■ i3„l| r n >9 ( n )} + 1. We will bound separately the terms A n and B n l^ T n >g ^y Let 
b n := min(l — t n ,r") < t\ (note that 6 n — > 1 A ri a.s.), choose some ko such that tk Q < 1/2, 
and henceforth assume WLOG that n > fco- Then t n < 1/2 so that 

K< ( J bn uN^(u)du | uJY A -"(u)rftA 
\J bn (u + t n )v(u + t n )du J l i 2 uv(u)du J 

For hi,li2 two strictly positive integrable functions over some interval [a, b] we always have 

that fa £h2dT - sup «e[a,b] Therefore 

4„ < sup ^ A ' n (") +o( r uN A > n (u)du) 
«e[0,6 n ] ^(*n + «) VA n / 

= ^p ^-^ + 0(iV A (l/2)(T 1 ) 2 ), (28) 
u€[0,l] + u ) 

due to 6 n < 1 and iV A > n (& n ) = l {6n=rf <!_ M + iV A ' n (l - t n )l {bn=l _ tn > l/2} < N A (l/2), a.s. 
Similarly we have 



J Tl v(u + t n )du Jg 1 tn v(u + t n )du 

*nl{T?>g{n)} ~ n n 1 {r 1 n >fl(n)} 

j i N^' n (u)du Jq 1 v(u + t n )du 



v(t n + u) Jt n v(u)du 

< sup 



Observe that due to i n ~ en 1 ™" (claim 1 of Lemma [10]) we have £ n = o(g(n)) if a > 3/2, 
and t n /g(n) — > d S (0, oo) if a < 3/2. Due to the asymptotic form ([TT|) for v, the sequence 
( v(u)du/ jj; n+9 ^ v(u)du) n >k is uniformly bounded. We conclude that 

BnHr?> 9i n) } = O ( Bup^ J . (29) 
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Combining ([27|) - (p9|) . we obtain 



^T = 



9{n) 



sup ^!W- +w a (1/2)(ti)2 W .«. + «) 



ue[0,l}V(t n + u) ' / \ae[0,Ti] iVA ' n ( U 




+ 1, Mn > ko. 



By the Cauchy-Schwarz inequality, it suffices to show that, for all n sufficiently large, each 
factor in the brackets above is bounded in L 4 . This follows immediately from: 

(i) [4J Eq. (38)] applied with s = 1 and n > uqV k$ (see above (33) in [4] for the definition 
of r(x; s), Lemma 19 in [4] for the choice of no, and note that this step uses the condition 
A[l — rj, 1] =0 for some r\ > 0), 

(ii) iV A (l/2) € L p for all p > 1 (a consequence of Theorem 2 in [4]). 

(iii) claim 3 of Lemma [TU1 

This proves the uniform integrability of (Y n /g(n)) n >i, and completes the proof of (fTOj) . 
2.2 Proof of Theorem [3] 

Recall the construction of Section II, 1\ where the genealogy with mutations is realized for all 
n simultaneously, with nice monotonicity properties. In this and the next subsection we will 
often refer to it under the name the full genealogy (construction or coupling). Furthermore, 
we are going to prove theorems [3] and [6] under the additional assumption 

3r] > : A[l-77,1] = 0, 

and we will then explain how this hypothesis can be relaxed in Section [2.41 
Case X n = S n . For each s > we have, due to Theorem 5 in [4], 

f°N A ' n (t)dt , s 

lim rs 71 — , a 77 = 1, m probabdity. (30) 
n_>0 ° Jo v ( tn + *) dt 

For the Kingman and the regular variation coalescents (see Definition [2]) the above conver- 
gence holds almost surely. The total length of T n is L n = J^ 1 N A,n (t) dt. Observe that 
N A > n (u)du/ $Qv(t n + t)dt\ almost surely since in all cases | Jl 1 N A < n {u)du\ < 
| fp N A (u)du\ < oo a.s., and Jq v(t n + t) dt diverges in n. Applying ([30]) with s = 1, we 
deduce that 

L n ~ / v(t n + 1) dt 
Jo 

in probability (i.e., the ratio of the two sides tends to 1 in probability), and almost surely in 
the regular variation case. Using the facts that v(t n ) = n and v'(q) = —^(v(q)) for all q > 0, 
and applying a change of variables q = v(t), we obtain 

f 1 f n q f n q f n q 

/ v(t n +t)dt = I ——dq~ / ——-dq~ I —— dq, as n -> oo, 

Jo Jv(i+t n ) iPKQ) Jv(i) VW Ji v{q) 

since v(l + t n ) — > v{l) G (0, oo), and since the integral of q/ip(q) is finite (resp. infinite) over 
[a, b] (resp. [a, oo)), for all fixed a, b € (0, oo). 
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Recall again the fact that S n has Poisson (9L n ) distribution, given T n . Now due to 
L n — > oo, almost surely, we obtain 

0, (31) 



qip(q) 1 dq 

in probability, as claimed. In the regular variation case, this last convergence holds again in 
the almost sure sense due the fact that in the full genealogy coupling S n < S n +i, for all n, 
almost surely, and that L qip(q)~ l dq is asymptotic to a multiple of n 2 ~~ a . To obtain the final 
claim, we recall ([3]). Integrating the RHS and recalling (|31|) . we deduce that S n ~ 9Bn 2 ~ a , 
almost surely, where B is as stated in Theorem [3j in consistence with Theorem 1.9 of [6]. □ 

For X n = A n , our strategy is as follows: we first establish the convergence in probability 
of A n in the general case, and then show the almost sure convergence in the strong regular 
variation case. 

Case X n = A n , convergence in probability. In the full genealogy construction, we have 
A n < S n + 1 for each n, almost surely. Therefore, (13ip implies that for any e > 0, 



A n >{l + e)df^q^{ q y 1 dq^j^^. (32) 

It remains to prove the matching lower-bound. To do this, for each mutation (or mark) x 
on T n , consider the path 7 = j(x) C T n defined as follows. Consider a mutation or mark 
x £ T n with age t. Then j(x) is defined as the path connecting the mark to the leaf carrying 
the smallest label possible. Since all points of 7 lie below x, the age of any point y G 7 is at 
most t. For example, the 7 of the mutation encircled in gray on Figure 1 is the path linking it 
to the leaf labeled by 1. We say that a mark x is unblocked if j(x) carries no other mutation 
than x, and otherwise call it blocked. Observe that if x is unblocked then it is guaranteed to 
contribute one allelic type to A n . Intuitively, it is rather likely that 7(3;) is unblocked. Indeed, 
since the age M n of a randomly chosen point on T n is typically small, then e~ 9Mn ~ 1 — 0M n , 
so the probability that a typical mutation is blocked is of order 6K(M n ) — > 0. This suggests 
that the proportion of blocked mutations is negligible, which is sufficient to yield the desired 
result. 

More rigorously, given and S n ^ the mutations fall on T n as S n i.i.d. uniformly chosen 
random points. For 1 < i < S ni let K^ n be the "good" event that the ith mutation is 
unblocked, and define 

i=l 

the total number of unblocked mutations. As already argued, we have Y n < A n < S n + 1 
almost surely, so in view of (f3Tj) it suffices to prove 

Y 

lim — ^ = 1, in probability. (33) 

n— s-oo S n 

Note that, given T n and S n , the events {Ki n)i=l,...,S n are exchangeable. In particular, almost 
surely, 

^{l^i,n |T n , S n ) = P(i^j in |T n , S n ), % = 1, . . . , S n . 
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Note in addition that the age of the mutation corresponding to K\ n is equal in distribution 
to M n from Proposition [7J Due to the above discussion, we have 



,n|T n , S n , M n 



1 



Mn 



(Sn-1) + 



(34) 



almost surely on the event {S n > 0}. We extend the definition of K\ >n using the above on the 
complement {S n = 0}, making K\^ n certain in this case. Fix e > and note that by Markov's 
inequality, we have 



»(Yn < (1 - £)S n \T n , S n ) = F(S n -Y n > eS n |T n , S n ) < 



sSn 



with the convention 0/0 = 1. Therefore, due to the above discussion and (I34p . we obtain 



¥{Y n < (1 - e)S n ) < -E 

< -E 

e 



Sn 



-E [P(^ tn |T ni 5 n ,M n )] 



(s n -i)+ 



(35) 



The random variable (conditional probability) in this last expectation is bounded by 1, almost 
surely. Therefore, in order to show that it converges to in the mean (in L 1 ), it suffices to 
show that it converges to in probability. Now note that, since 1 — (1 — x) n < nx for n E N 
and x > 0, 

1 " ( 1 " Y 1 ) < 20M n l {Sn/Ln < 2e} + l { SJL n >28} (36) 



So, for a fixed small 8 > 0, we have 

?n-i)+ \ 

L - ( 1 - -r 1 - I > 8 < F(29M n >8)+ F(S n > 20L n ). 



(37) 



Due to Proposition [7J (a) , the first term on the RHS in (|3T[) vanishes as n — > oo, and since S n 
has Poisson (rate 6L n ) distribution, given L n , the second term also vanishes. Therefore (|35 [) 
converges to as n — > oo, implying (f33|) . □ 
Case X n = A n with strong a-regular variation. Here we use a variation of 



1- 1 



< I M n ■ 



Sn 



A 1 < 20M n l {Sn/£ , n < 2 0} + A — \ ■ j^l{ Sn /L n >2e}- 

(38) 

Therefore, applying the Cauchy-Schwarz inequality, 



E 



L n 



< 29E(M n ) + VE[(M n A l/(20)) 2 ] • E[(5 n /L n ) 2 ]. 



Due to Proposition [7J (b), the first term above is 0(n 2|5 ) for some 8 > 0. For the second one, 
note that E[(M„ A l/(2^)) 2 ] < E(M n )/(25) = 0(n- 25 /#) and that E[(5 n /L n ) 2 ] = 0(6> 2 V 6>). 
Indeed, since S n is Poisson (rate 0L n ), given L n , we have (assuming n > 3) 



E[(5 n /L n ) 2 ] = E[E(5 2 |L n )/L 2 ] = E[^ 2 + 6/L n ] < 2 + 0E(1/L 3 
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where it is simple to verify that K(l/L^) < oo. As a consequence, E(l— (1 — M n / L n ) <ySn = 
0{n- s ). Due to ([35]), we deduce 

nY n < (1 - e)S n ) < — (39) 

for some 5 > 0, and c < oo which depends only on 6. Consider the subsequence n/% = |_& 2 / 5 J , 
k > 1. By the Borel-Cantelli lemma and ([39]) . Y n /S n tends to 1 along the subsequence (nk)- 
Moreover, since both A n < A n+ i and S n < S n +i, for all n, almost surely, we have 

fe < "7t~ — nfc+1 " fc+1 whenever n € [^fc, n fc+i] f° r some k > 1. (40) 



Since we already verified at the beginning of the argument that S nk /S nk+1 ~ (fifc/^fc+i) 2- ", 
almost surely, and since {nk/nk+i) 2 ~ a 1, as k — > oo, the almost sure convergence along the 
subsequence (nfc)fc>i and PU|) imply that A n /S n — >• 1 almost surely. This finishes the proof 
of Theorem [3l □ 

Remark 11. As already mentioned, the above convergence in probability for X n — A n is 
proved in [23] for a more general class of regular H-coalescents, using a compact martingale 
argument that accounts for all the mutations in a dynamic way (from the point of view of 
coalescent evolution), which can be easily extended to handle randomly (and nicely) vary- 
ing mutation rates. However, that approach is not well-suited for obtaining qualitative or 
quantitative information about a random (typical) mutation. The present approach could be 
used even in the setting without martingale structure, once given the estimates in the form of 
Proposition [7] Furthermore, the random mutation analysis enables us to easily identify the 
asymptotic behavior of Mk n with that of (see the end of the proof of Theorem [6]). 

2 . 3 Proof of Theorem © 

Recall the setting of Theorem [6] We first concentrate on the result ([6]) in the case of the 
allelic partition, which we restate here for convenience: if n denotes the number of allelic 
types in the allelic partition carried by exactly k individuals, then for any fixed k > 1, 

F k,n flD / n . (a — 1) ... (a + A; — 3) 

^ B (2 - a) ,a.s. (41) 

as n — >• oo. The key to proving (j41[) is to apply Corollary 21 in [16] , which could be thought 
of as a Tauberian theorem for random exchangeable partitions, that establishes the mutual 
equivalence between the strong almost sure asymptotics ([5]), ([6]), and ([7]). 

We now recall the setting in [16] . Let p = (pi, P2> • • •) be a deterministic sequence such 
that pi > p2 > ■ ■ ■ > and YliPl = 1- Suppose that = Qf is an exchangeable random 
partition on N, obtained by performing the paintbox construction generated by p (see, e.g., [lj 
or Definition 1.2 in |8]). Let Q n denote the restriction of onto [n] = {1, . . . , n}. Let K n be 
the number of blocks in G n , and for each r = 1, . . . , n, let K n ^ r be the number of blocks in 
Q n containing exactly r elements. The frequency vector p is said to be regularly varying with 
index 7 if 

1 {p l >^} ~ ^/x)x~" Y 

i 

as x — > 0, where £ is a slowly varying function. 
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Lemma 12. (Corollary 21 in |16j ) There is equivalence between the following statements. 



(a) p is regularly varying with index 7. 

(b) K n ~ r(l — 7)n 7 £(n), almost surely as n — > 00. 

If either (a) or (b) holds, then for each fixed r > 1, 

^rfr — 7) 

K nr ~ - re 7 £(ra), almost surely. 

We refer the reader to Theorem 1.11 in [8 J for an overview and a sketch of proof, and 
to Schweinsberg [31] for a version of this result where the assumptions and conclusions are 
convergence in probability, rather than almost surely. 

Proof of Theorem® We apply the above lemma to the allelic partition 0, which is an ex- 
changeable random partition. As the reader is about to see, for this particular application 
the almost sure convergence in ([5]) of Theorem [3] is crucial. Since is random exchange- 
able, by Kingman's representation theorem (Theorem 1.1 in [8]), all the blocks of have a 
well-defined asymptotic frequency. We let P be the sequence of block frequencies in decreas- 
ing order. Thus P G V<i = {(pi,P2---,) '■ Pi > P2 > • • • > OjX^iP* — ■"■}• Moreover, 
given P, has the law of a paintbox partition derived from P. Note that A n then corre- 
sponds to the total number of blocks of n , while F^^ is the number of blocks of size exactly 
k. In particular, since A n = o(n) almost surely, it must be that P(P 6 Vi) = 1, where 
Vi = {(pi,P2 ■ ■ ■ j ) '• Pi > P2 > • • • > 0, Yl^LiPi = That is, has no singletons (or no 
"dust") almost surely. Moreover, since P(A n ~ 9Bn 2 ~ a ) = 1 by ([5]), then also 

P(^ n ~ 8Bn 2 - a \P) = l,a.s. 

Therefore, Corollary 21 in [16] implies that 

P(P is regularly varying with index 2 — a) = 1, 

and, moreover that, if iV(ac) = Yli>i ^-{Pi>x}i then 

05 



~ ~r \ xC " 2 1 almost surely. 

T(a — 1) 



eB-(2-a)T(k-2 + a) 2 _ a 
Fk > n ~ rT(a - 1) n 



P = l,a.s. 



Furthermore, 



Taking expectations in the last identity yields (HI]). 

It remains us to prove ([6]) in the case where Xf. )Tl = Mk >n , the number of genetic types 
under the infinite sites model. Observe another important property of our full genealogy 
coupling (cf. Figure 1): a family of size k in the infinite allele model necessarily descends from 
the same mutation, and therefore this mutation affects at least k leaves. Thus, for all n > 1, 
and for all fixed k G {1, . . . ,n}, we have that 



F k ,n < Mk,n, (42) 
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where Fk )tl = Sj=fe Fj,n an d Mk :U = Y^=k Mj, n are the cumulative number of families of size 
larger or equal to k. Let 

f M -l)... (a + k -3) 
c fc = (2 - a) 

so that F k n ~ n 2 ~ a 6Bck- Observe that F\ n = A n and thus (since -Ffc+i jn = F kn — F kn ) we 
deduce by induction on k > 1 that -Ffc n ~ n 2 ~ a 6Bck, where Ck = ^2^Lk c j- Here we use the 
fact ci = 1 (see [7J, Lemma 30 or [27] , display (3.38)). 

Therefore, Theorem [6] will be proved, provided we show that, for each fixed k > 1, 

M Kn <F k ,n + o{n 2 - a ), (43) 

almost surely as n — > oo. This can be done by the following adaptation of the argument for 
Theorem [3j 

Fix k > 1. We extend the definition of an unblocked mutation as follows. 

Recall that any point x £ T n corresponds uniquely to a block 1? of the coalescing partition 
IT, where t is the age of x. Suppose that B = {i\ < . . . < i m }, for some m G N, and define 
T(x) C T n to be the restriction of the coalescence subtree generated by the paths that lead 
from x to the leaves labeled by i\, . . . , i m /\k- F° r example, for the mutation encircled in black 
on Figure 1, this subtree has four leaves labeled by {2,3,5,6}. Note furthermore that, for 
k = 1, T(x) coincides with the path j(x) defined in the proof of Theorem [31 and that the 
total length of T{x) cannot exceed kt. 

Let us say that x £ T n is k-unblocked if T(x) carries no other mark than x, and otherwise 
call it k-blocked. Similarly to the proof of Theorem [3j define Ki >n as the event that the ith 
mutation (picked at random without replacement) is A:-unblocked, and let Yk,n '■= Si=i ^-K ■ 
Reasoning as for (I35D . and using the fact that the length of T which corresponds to the 
randomly picked mutation is at most kM n , we obtain, for all fixed e > 0, 

H%,n < (1 - e)S n ) < 
Since k is fixed, the bound (f38|) with M n replaced by kM n will lead to 

P(n ( »<(l-eW<— (44) 

where S is as in (f39|) . and c depends only on 8. Therefore, we have as before Yk jn /S n — > 1 
almost surely along the subsequence (fij), where rij = [j 2 ^\, J 1 > 1. In particular, S nj — 
Y k , nj = o(n 2 r a ), j>l. 

Denote by Mi the number of fc-unblocked mutations that span at least k leaves. For 
each such mutation, the corresponding family in the allelic partition is of size at least k, so 
M' k n < F ktn . Moreover, 

< M Kn - M' Kn <S n - Y Kn 

since S n — Yk, n accounts for all the fc-blocked mutations, even if they span fewer than k leaves. 
Thus, due to the previous observations, 

M' Kn . = M k , nj - o(n 2 - a ) < F k>nj < M k>n ., j > 1, (45) 



kM ri 

Ln 



(Sn-1) + 
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implying (ji3|) with rij in place of n, and in particular Mk,n ~ ^fc,n 5 along the subsequence 
( n j)j>i- Since we already know that F n ^ ~ 6Bc.kn 2 ~ a almost surely, and since both {F n ^) n >i 
and (Mn,fc)n>i are non-decreasing, almost surely, reasoning as in (|4"Dj) gives 

M M ~ 9Bc k n 2 - a , 

as n — > oo, which finishes the proof of Theorem [6J □ 

Here is an interesting consequence about the structure of T n n V, Let #{T(x) be the 
number of leaves of T n contained in T(x). If a; is a randomly chosen mutation, then denote 
its #eT(x) simply by # e T. 

Corollary 13. We have lim^oo linin^cx, P(#^T > k) = 0. 

This claim is weaker than the statement that 4^iT is stochastically bounded. 

Proof. Given T n , V, the probability that > k equals precisely M^^ n /S n . Due to Theorem 
[fH the almost sure limit of this is c k , defined in the proof above. Since Mk ;Tl /S n € [0, 1], this 
convergence is also in L , and the corollary follows due to lim^ c& = 0. □ 

2.4 Relaxing the condition A[l — r/,1] = 

Define A v (dx) := A(dx)lro ) i_,v] (x). Then it is easy to see (or consult, e.g., [3]) that there exists 
a path- wise full genealogy coupling of the corresponding A-coalescent and A^-coalescent, so 
that, almost surely, for each n > 1, N A > n (t) = N A ^ n (t), for all t G [0,T V ], and N A > n (t) < 
N Av,n (t) for all t > Tr), where is an exponential (rate 1/x 2 A(dx) < oo) random 

variable. By the "full genealogy" coupling, we also mean that the same realization of the 
mutation process V is used for both the restriction of onto [0, T^], and the restriction of 
Tn' onto [0,1^], simultaneously for all n. 

The crucial fact is that then the family of non-negative random variables 

max J ( * N A > n (t)dt, I * N Ar " n (t) dt\ , n > 1, 
[Jt v Jt v J 

is bounded from above by a finite random variable, almost surely. Therefore, L A ~ Ln v almost 
surely, as well as, S A ~ Sn' 1 and A A ~ An v , again almost surely, as n — > oo. This implies 
Theorem [3] in the general case. And similarly, in the above coupling, we have F A k ~ F nk and 
M A k ~ M ' 1, for each fixed k, almost surely as n — >• oo, yielding Theorem [6l 
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